Overview

Dataset statistics

Number of variables12
Number of observations39948
Missing cells0
Missing cells (%)0.0%
Duplicate rows22
Duplicate rows (%)0.1%
Total size in memory3.7 MiB
Average record size in memory96.0 B

Variable types

NUM9
CAT3

Reproduction

Analysis started2021-03-11 01:58:56.458074
Analysis finished2021-03-11 01:59:13.133202
Duration16.68 seconds
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

Dataset has 22 (0.1%) duplicate rows Duplicates
impression is highly skewed (γ1 = 159.0616115) Skewed
query_id has 571 (1.4%) zeros Zeros
keyword_id has 570 (1.4%) zeros Zeros
user_id has 9633 (24.1%) zeros Zeros

Variables

click
Categorical

Distinct count2
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size312.1 KiB
0
33220
1
6728
ValueCountFrequency (%) 
03322083.2%
 
1672816.8%
 

Length

Max length1
Median length1
Mean length1
Min length1

impression
Real number (ℝ≥0)

SKEWED

Distinct count99
Unique (%)0.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2.100205266846901
Minimum1
Maximum11820
Zeros0
Zeros (%)0.0%
Memory size312.1 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q31
95-th percentile3
Maximum11820
Range11819
Interquartile range (IQR)0

Descriptive statistics

Standard deviation65.86738287
Coefficient of variation (CV)31.36235486
Kurtosis27130.68877
Mean2.100205267
Median Absolute Deviation (MAD)0
Skewness159.0616115
Sum83899
Variance4338.512126
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
13384684.7%
 
236709.2%
 
310322.6%
 
44441.1%
 
52220.6%
 
61580.4%
 
71050.3%
 
8780.2%
 
10380.1%
 
9330.1%
 
Other values (89)3220.8%
 
ValueCountFrequency (%) 
13384684.7%
 
236709.2%
 
310322.6%
 
44441.1%
 
52220.6%
 
ValueCountFrequency (%) 
118201< 0.1%
 
54671< 0.1%
 
10091< 0.1%
 
5841< 0.1%
 
4901< 0.1%
 

url_hash
Real number (ℝ≥0)

Distinct count6941
Unique (%)17.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean9.641350419145736e+18
Minimum482436910553333.0
Maximum1.844094316957687e+19
Zeros0
Zeros (%)0.0%
Memory size312.1 KiB

Quantile statistics

Minimum4.824369106e+14
5-th percentile1.33593611e+18
Q15.468727571e+18
median1.034946865e+19
Q31.434039016e+19
95-th percentile1.702769257e+19
Maximum1.844094317e+19
Range1.844046073e+19
Interquartile range (IQR)8.871662586e+18

Descriptive statistics

Standard deviation4.98670453e+18
Coefficient of variation (CV)0.5172205462
Kurtosis-1.115120878
Mean9.641350419e+18
Median Absolute Deviation (MAD)3.990921506e+18
Skewness-0.2372171121
Sum3.851526665e+23
Variance2.486722207e+37
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
1.434039016e+1938919.7%
 
1.2057879e+1933818.5%
 
7.903914528e+1811072.8%
 
4.298118681e+184211.1%
 
1.453186765e+193951.0%
 
1.375625754e+193941.0%
 
5.851252814e+183650.9%
 
1.475657876e+192950.7%
 
1.514548016e+192950.7%
 
2.69285962e+182870.7%
 
Other values (6931)2911772.9%
 
ValueCountFrequency (%) 
4.824369106e+141< 0.1%
 
1.234866104e+152< 0.1%
 
2.068449938e+153< 0.1%
 
1.711420047e+161< 0.1%
 
2.416657436e+161< 0.1%
 
ValueCountFrequency (%) 
1.844094317e+193< 0.1%
 
1.843882671e+197< 0.1%
 
1.843731353e+192< 0.1%
 
1.843728766e+192< 0.1%
 
1.843108375e+191< 0.1%
 

ad_id
Real number (ℝ≥0)

Distinct count19228
Unique (%)48.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean16016715.903249225
Minimum1000515
Maximum22227340
Zeros0
Zeros (%)0.0%
Memory size312.1 KiB

Quantile statistics

Minimum1000515
5-th percentile3066490
Q19027238
median20303729.5
Q321163923
95-th percentile21872897.7
Maximum22227340
Range21226825
Interquartile range (IQR)12136685

Descriptive statistics

Standard deviation7222259.539
Coefficient of variation (CV)0.4509201251
Kurtosis-1.022447811
Mean16016715.9
Median Absolute Deviation (MAD)1021970.5
Skewness-0.8821406758
Sum6.398357669e+11
Variance5.216103284e+13
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
90272133550.9%
 
201926762210.6%
 
215227762160.5%
 
209081962070.5%
 
30480111530.4%
 
206440451520.4%
 
211639231400.4%
 
30655451270.3%
 
200170781170.3%
 
200301651100.3%
 
Other values (19218)3815095.5%
 
ValueCountFrequency (%) 
10005152< 0.1%
 
10006992< 0.1%
 
10008062< 0.1%
 
10008291< 0.1%
 
10008301< 0.1%
 
ValueCountFrequency (%) 
222273402< 0.1%
 
222271231< 0.1%
 
222270661< 0.1%
 
222267921< 0.1%
 
222266851< 0.1%
 

advertiser_id
Real number (ℝ≥0)

Distinct count6064
Unique (%)15.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean22454.496545509162
Minimum82
Maximum39074
Zeros0
Zeros (%)0.0%
Memory size312.1 KiB

Quantile statistics

Minimum82
5-th percentile1268
Q113476.5
median23808
Q332124
95-th percentile37422
Maximum39074
Range38992
Interquartile range (IQR)18647.5

Descriptive statistics

Standard deviation11796.0858
Coefficient of variation (CV)0.5253329004
Kurtosis-0.8357317704
Mean22454.49655
Median Absolute Deviation (MAD)9024
Skewness-0.6158151524
Sum897012228
Variance139147640.2
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
2796133638.4%
 
2380824206.1%
 
2377715924.0%
 
13257902.0%
 
237784811.2%
 
12684481.1%
 
3854211.1%
 
238073951.0%
 
286983650.9%
 
243543560.9%
 
Other values (6054)2931773.4%
 
ValueCountFrequency (%) 
82290.1%
 
851< 0.1%
 
871< 0.1%
 
883< 0.1%
 
942< 0.1%
 
ValueCountFrequency (%) 
390742< 0.1%
 
389701< 0.1%
 
389611< 0.1%
 
389561< 0.1%
 
389421< 0.1%
 

depth
Categorical

Distinct count3
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size312.1 KiB
2
19439
1
11053
3
9456
ValueCountFrequency (%) 
21943948.7%
 
11105327.7%
 
3945623.7%
 

Length

Max length1
Median length1
Mean length1
Min length1

position
Categorical

Distinct count3
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size312.1 KiB
1
24417
2
12532
3
 
2999
ValueCountFrequency (%) 
12441761.1%
 
21253231.4%
 
329997.5%
 

Length

Max length1
Median length1
Mean length1
Min length1

query_id
Real number (ℝ≥0)

ZEROS

Distinct count30748
Unique (%)77.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3142145.502903775
Minimum0
Maximum26240100
Zeros571
Zeros (%)1.4%
Memory size312.1 KiB

Quantile statistics

Minimum0
5-th percentile8
Q12364.25
median112836.5
Q33147909.25
95-th percentile17386650.7
Maximum26240100
Range26240100
Interquartile range (IQR)3145545

Descriptive statistics

Standard deviation5841539.586
Coefficient of variation (CV)1.859092642
Kurtosis3.859580544
Mean3142145.503
Median Absolute Deviation (MAD)112830.5
Skewness2.151205504
Sum1.255224286e+11
Variance3.412358473e+13
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
05711.4%
 
14141.0%
 
22170.5%
 
41760.4%
 
81690.4%
 
51640.4%
 
31540.4%
 
61400.4%
 
71200.3%
 
15830.2%
 
Other values (30738)3774094.5%
 
ValueCountFrequency (%) 
05711.4%
 
14141.0%
 
22170.5%
 
31540.4%
 
41760.4%
 
ValueCountFrequency (%) 
262401001< 0.1%
 
262314581< 0.1%
 
262300091< 0.1%
 
262294271< 0.1%
 
262291941< 0.1%
 

keyword_id
Real number (ℝ≥0)

ZEROS

Distinct count19803
Unique (%)49.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean35194.431360769
Minimum0
Maximum1243163
Zeros570
Zeros (%)1.4%
Memory size312.1 KiB

Quantile statistics

Minimum0
5-th percentile9
Q1370
median3389
Q321030
95-th percentile178425.05
Maximum1243163
Range1243163
Interquartile range (IQR)20660

Descriptive statistics

Standard deviation100914.8155
Coefficient of variation (CV)2.867351784
Kurtosis44.26949755
Mean35194.43136
Median Absolute Deviation (MAD)3355
Skewness5.862208181
Sum1405947144
Variance1.01838e+10
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
05701.4%
 
13831.0%
 
21820.5%
 
81640.4%
 
31590.4%
 
61520.4%
 
41500.4%
 
101400.4%
 
51250.3%
 
91160.3%
 
Other values (19793)3780794.6%
 
ValueCountFrequency (%) 
05701.4%
 
13831.0%
 
21820.5%
 
31590.4%
 
41500.4%
 
ValueCountFrequency (%) 
12431631< 0.1%
 
12429101< 0.1%
 
12409951< 0.1%
 
12349561< 0.1%
 
12317691< 0.1%
 

title_id
Real number (ℝ≥0)

Distinct count25321
Unique (%)63.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean173282.90677881247
Minimum0
Maximum4050208
Zeros355
Zeros (%)0.9%
Memory size312.1 KiB

Quantile statistics

Minimum0
5-th percentile15
Q1670.75
median10654
Q3100289.5
95-th percentile959922.65
Maximum4050208
Range4050208
Interquartile range (IQR)99618.75

Descriptive statistics

Standard deviation465674.7878
Coefficient of variation (CV)2.68736713
Kurtosis24.91604281
Mean173282.9068
Median Absolute Deviation (MAD)10626
Skewness4.604918963
Sum6922305560
Variance2.16853008e+11
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
03550.9%
 
41680.4%
 
21670.4%
 
11520.4%
 
31350.3%
 
51260.3%
 
81140.3%
 
71140.3%
 
91130.3%
 
61110.3%
 
Other values (25311)3839396.1%
 
ValueCountFrequency (%) 
03550.9%
 
11520.4%
 
21670.4%
 
31350.3%
 
41680.4%
 
ValueCountFrequency (%) 
40502081< 0.1%
 
40391021< 0.1%
 
40289161< 0.1%
 
40288141< 0.1%
 
40277341< 0.1%
 

description_id
Real number (ℝ≥0)

Distinct count22381
Unique (%)56.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean111150.89986983078
Minimum0
Maximum3171504
Zeros355
Zeros (%)0.9%
Memory size312.1 KiB

Quantile statistics

Minimum0
5-th percentile12
Q1356
median5048
Q352861.75
95-th percentile597557.2
Maximum3171504
Range3171504
Interquartile range (IQR)52505.75

Descriptive statistics

Standard deviation328374.223
Coefficient of variation (CV)2.954310072
Kurtosis32.13248565
Mean111150.8999
Median Absolute Deviation (MAD)5025
Skewness5.185337063
Sum4440256148
Variance1.078296304e+11
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
03550.9%
 
11900.5%
 
51670.4%
 
21590.4%
 
41540.4%
 
31520.4%
 
61460.4%
 
91430.4%
 
71350.3%
 
81290.3%
 
Other values (22371)3821895.7%
 
ValueCountFrequency (%) 
03550.9%
 
11900.5%
 
21590.4%
 
31520.4%
 
41540.4%
 
ValueCountFrequency (%) 
31715041< 0.1%
 
31693561< 0.1%
 
31626591< 0.1%
 
31536791< 0.1%
 
31524681< 0.1%
 

user_id
Real number (ℝ≥0)

ZEROS

Distinct count30114
Unique (%)75.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3669622.261865425
Minimum0
Maximum23907337
Zeros9633
Zeros (%)24.1%
Memory size312.1 KiB

Quantile statistics

Minimum0
5-th percentile0
Q11472.25
median888386.5
Q35129631.25
95-th percentile16392696.65
Maximum23907337
Range23907337
Interquartile range (IQR)5128159

Descriptive statistics

Standard deviation5492058.472
Coefficient of variation (CV)1.49662774
Kurtosis2.398862313
Mean3669622.262
Median Absolute Deviation (MAD)888386.5
Skewness1.777911493
Sum1.465940701e+11
Variance3.016270626e+13
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
0963324.1%
 
26< 0.1%
 
1874< 0.1%
 
1243< 0.1%
 
2293< 0.1%
 
613< 0.1%
 
1253< 0.1%
 
1543< 0.1%
 
523< 0.1%
 
563< 0.1%
 
Other values (30104)3028475.8%
 
ValueCountFrequency (%) 
0963324.1%
 
11< 0.1%
 
26< 0.1%
 
41< 0.1%
 
52< 0.1%
 
ValueCountFrequency (%) 
239073371< 0.1%
 
239027321< 0.1%
 
238946821< 0.1%
 
238883011< 0.1%
 
238845331< 0.1%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Missing values

Sample

First rows

clickimpressionurl_hashad_idadvertiser_iddepthpositionquery_idkeyword_idtitle_iddescription_iduser_id
0011.071003e+19834329511700337702266212642789215590
1111.736385e+192001707723798119307935498436476562934
2018.915473e+182134835436654111098119975361053329211621116
3014.426693e+1820366086332803305942405743908778348
4011.157260e+196803526107902198819786059325242167912118311
5112.827577e+172118647835793211633154871325711532886008
6018.813903e+18208866903484022316543220628887589739
7013.811035e+18213673762066732260143911895949705579253
8019.806838e+18218117523773732163133384121755277279
9111.434039e+1990272132380821510011808635

Last rows

clickimpressionurl_hashad_idadvertiser_iddepthpositionquery_idkeyword_idtitle_iddescription_iduser_id
39938021.434039e+1921163921238081120912536513373840187
39939011.544609e+1921098737707311105129102737657519360322657879
39940011.637837e+19212291833586021486362646735848616651485104
39941011.146304e+19212500083536422879683647353395665027883198276
39942011.768833e+1920382992174032211881210921314167372733
39943013.593550e+18218986433786721128259391091165719140
39944011.760828e+19205755788873211169983387866921019487
39945059.613260e+1821183848187162124382695948881132772305
39946019.750423e+1821222438358803371308041307894312214360
39947011.205788e+19201802452796111216593973288617887245602668

Duplicate rows

Most frequent

clickimpressionurl_hashad_idadvertiser_iddepthpositionquery_idkeyword_idtitle_iddescription_iduser_idcount
0011.251029e+18203465891614232117400868465380276998574312
1012.670953e+182017289023805321947149951666529302
2012.994925e+1820061287238031114650011323510673219607152
3014.298119e+18218660263851130530714735248652355631812
4015.824342e+184385495219593357727453245123249921839702
5017.903915e+1821162356132511495573236557895302
6017.903915e+1821372653201011207665627550984302
7019.317847e+182129905033746212427067188434934170255602
8011.205788e+1920157628279613300496098020642
9011.341274e+1921449769277123127598425450749531075616474032